This notebook is a template with each step that you need to complete for the project.
Please fill in your code where there are explicit ? markers in the notebook. You are welcome to add more cells and code as you see fit.
Once you have completed all the code implementations, please export your notebook as a HTML file so the reviews can view your code. Make sure you have all outputs correctly outputted.
File-> Export Notebook As... -> Export Notebook as HTML
There is a writeup to complete as well after all code implememtation is done. Please answer all questions and attach the necessary tables and charts. You can complete the writeup in either markdown or PDF.
Completing the code template and writeup template will cover all of the rubric points for this project.
The rubric contains "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. The stand out suggestions are optional. If you decide to pursue the "stand out suggestions", you can include the code in this notebook and also discuss the results in the writeup file.
Below is example of steps to get the API username and key. Each student will have their own username and key.
kaggle.json and use the username and key.
ml.t3.medium instance (2 vCPU + 4 GiB)Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir
# Without --no-cache-dir, smaller aws instances may have trouble installing
!pip install -U python-dotenv
!pip install -U kaggle
!pip install -U pandas-profiling
!pip install ipywidgets==7.7.2
!pip install pydantic==1.10.2
# create the .kaggle directory and an empty kaggle.json file
!mkdir -p /root/.kaggle
!touch /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
from dotenv import load_dotenv
from os import environ
load_dotenv()
True
# Fill in your user name and key from creating the kaggle account and API token file
import json
kaggle_username = environ.get("KAGGLE_USERNAME")
kaggle_key = environ.get("KAGGLE_KEY")
# Save API token the kaggle.json file
with open("/root/.kaggle/kaggle.json", "w") as f:
f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
#!kaggle competitions download -c bike-sharing-demand
# If you already downloaded it you can use the -o command to overwrite the file
!unzip -o bike-sharing-demand.zip
Archive: bike-sharing-demand.zip inflating: sampleSubmission.csv inflating: test.csv inflating: train.csv
import pandas as pd
from autogluon.tabular import TabularPredictor
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
# Create the train dataset in pandas by reading the csv
# Set the parsing of the datetime column so you can use some of the `dt` features in pandas later
train = pd.read_csv("train.csv", parse_dates=["datetime"])
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 10886 non-null datetime64[ns] 1 season 10886 non-null int64 2 holiday 10886 non-null int64 3 workingday 10886 non-null int64 4 weather 10886 non-null int64 5 temp 10886 non-null float64 6 atemp 10886 non-null float64 7 humidity 10886 non-null int64 8 windspeed 10886 non-null float64 9 casual 10886 non-null int64 10 registered 10886 non-null int64 11 count 10886 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(8) memory usage: 1020.7 KB
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()
| season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.00000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 |
| mean | 2.506614 | 0.028569 | 0.680875 | 1.418427 | 20.23086 | 23.655084 | 61.886460 | 12.799395 | 36.021955 | 155.552177 | 191.574132 |
| std | 1.116174 | 0.166599 | 0.466159 | 0.633839 | 7.79159 | 8.474601 | 19.245033 | 8.164537 | 49.960477 | 151.039033 | 181.144454 |
| min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.82000 | 0.760000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 13.94000 | 16.665000 | 47.000000 | 7.001500 | 4.000000 | 36.000000 | 42.000000 |
| 50% | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 20.50000 | 24.240000 | 62.000000 | 12.998000 | 17.000000 | 118.000000 | 145.000000 |
| 75% | 4.000000 | 0.000000 | 1.000000 | 2.000000 | 26.24000 | 31.060000 | 77.000000 | 16.997900 | 49.000000 | 222.000000 | 284.000000 |
| max | 4.000000 | 1.000000 | 1.000000 | 4.000000 | 41.00000 | 45.455000 | 100.000000 | 56.996900 | 367.000000 | 886.000000 | 977.000000 |
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv("test.csv", parse_dates=["datetime"])
test.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 |
| 1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 3 | 2011-01-20 03:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
| 4 | 2011-01-20 04:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
# Same thing as train and test dataset
submission = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
Requirements:
count, so it is the label we are setting.casual and registered columns as they are also not present in the test dataset. root_mean_squared_error as the metric to use for evaluation.best_quality to focus on creating the best model.learner_kwargs = {
"ignored_columns": ["casual", "registered"]
}
predictor = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression",
eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20221230_044222/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20221230_044222/"
AutoGluon Version: 0.6.1
Python Version: 3.7.10
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows: 10886
Train Data Columns: 11
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered']
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 3054.59 MB
Train Data (Original) Memory Usage: 0.78 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
good_rows = series[~series.isin(bad_rows)].astype(np.int64)
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('datetime', []) : 1 | ['datetime']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 5 | ['season', 'holiday', 'workingday', 'weather', 'humidity']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 3 | ['season', 'weather', 'humidity']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.5s = Fit runtime
9 features in original data used to generate 13 features in processed data.
Train Data (Processed) Memory Usage: 0.98 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.62s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.49s of the 599.38s of remaining time.
-101.5462 = Validation score (-root_mean_squared_error)
0.03s = Training runtime
0.1s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 397.29s of the 597.19s of remaining time.
-84.1251 = Validation score (-root_mean_squared_error)
0.03s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 396.94s of the 596.83s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-131.4609 = Validation score (-root_mean_squared_error)
65.58s = Training runtime
6.75s = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 320.06s of the 519.95s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-131.0542 = Validation score (-root_mean_squared_error)
31.54s = Training runtime
1.4s = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 283.73s of the 483.62s of remaining time.
-116.5443 = Validation score (-root_mean_squared_error)
11.12s = Training runtime
0.55s = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 269.42s of the 469.31s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-130.5252 = Validation score (-root_mean_squared_error)
200.14s = Training runtime
0.08s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 64.9s of the 264.79s of remaining time.
-124.5881 = Validation score (-root_mean_squared_error)
6.54s = Training runtime
0.68s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 54.91s of the 254.8s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-138.9149 = Validation score (-root_mean_squared_error)
74.59s = Training runtime
0.77s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 173.12s of remaining time.
-84.1251 = Validation score (-root_mean_squared_error)
0.73s = Training runtime
0.0s = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 172.3s of the 172.27s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-60.4113 = Validation score (-root_mean_squared_error)
54.59s = Training runtime
3.12s = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 112.97s of the 112.95s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-55.0656 = Validation score (-root_mean_squared_error)
25.47s = Training runtime
0.3s = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 83.4s of the 83.37s of remaining time.
-53.42 = Validation score (-root_mean_squared_error)
26.83s = Training runtime
0.62s = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 53.51s of the 53.48s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-55.7491 = Validation score (-root_mean_squared_error)
57.02s = Training runtime
0.07s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -7.83s of remaining time.
-53.1144 = Validation score (-root_mean_squared_error)
0.28s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 608.3s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20221230_044222/")
predictor.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -53.114374 14.542919 553.766336 0.000817 0.278653 3 True 14
1 RandomForestMSE_BAG_L2 -53.420043 11.062850 416.399112 0.618036 26.825663 2 True 12
2 LightGBM_BAG_L2 -55.065570 10.741559 415.044961 0.296744 25.471513 2 True 11
3 CatBoost_BAG_L2 -55.749057 10.512298 446.595644 0.067484 57.022196 2 True 13
4 LightGBMXT_BAG_L2 -60.411326 13.559839 444.168311 3.115024 54.594863 2 True 10
5 KNeighborsDist_BAG_L1 -84.125061 0.103688 0.029149 0.103688 0.029149 1 True 2
6 WeightedEnsemble_L2 -84.125061 0.104831 0.762941 0.001143 0.733792 2 True 9
7 KNeighborsUnif_BAG_L1 -101.546199 0.104609 0.032093 0.104609 0.032093 1 True 1
8 RandomForestMSE_BAG_L1 -116.544294 0.552854 11.122160 0.552854 11.122160 1 True 5
9 ExtraTreesMSE_BAG_L1 -124.588053 0.682034 6.536114 0.682034 6.536114 1 True 7
10 CatBoost_BAG_L1 -130.525167 0.080831 200.143895 0.080831 200.143895 1 True 6
11 LightGBM_BAG_L1 -131.054162 1.400994 31.543081 1.400994 31.543081 1 True 4
12 LightGBMXT_BAG_L1 -131.460909 6.754477 65.577447 6.754477 65.577447 1 True 3
13 NeuralNetFastAI_BAG_L1 -138.914862 0.765329 74.589510 0.765329 74.589510 1 True 8
Number of models trained: 14
Types of models trained:
{'WeightedEnsembleModel', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_NNFastAiTabular', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_KNN', 'StackerEnsembleModel_RF'}
Bagging used: True (with 8 folds)
Multi-layer stack-ensembling used: True (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 3 | ['season', 'weather', 'humidity']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20221230_044222/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
'KNeighborsDist_BAG_L1': -84.12506123181602,
'LightGBMXT_BAG_L1': -131.46090891834504,
'LightGBM_BAG_L1': -131.054161598899,
'RandomForestMSE_BAG_L1': -116.54429428704391,
'CatBoost_BAG_L1': -130.52516708977194,
'ExtraTreesMSE_BAG_L1': -124.58805258915959,
'NeuralNetFastAI_BAG_L1': -138.9148618317948,
'WeightedEnsemble_L2': -84.12506123181602,
'LightGBMXT_BAG_L2': -60.41132611426569,
'LightGBM_BAG_L2': -55.06556954800326,
'RandomForestMSE_BAG_L2': -53.42004335942844,
'CatBoost_BAG_L2': -55.74905694074817,
'WeightedEnsemble_L3': -53.11437398485209},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/KNeighborsUnif_BAG_L1/',
'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/KNeighborsDist_BAG_L1/',
'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/LightGBMXT_BAG_L1/',
'LightGBM_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/LightGBM_BAG_L1/',
'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/RandomForestMSE_BAG_L1/',
'CatBoost_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/CatBoost_BAG_L1/',
'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/ExtraTreesMSE_BAG_L1/',
'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20221230_044222/models/NeuralNetFastAI_BAG_L1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20221230_044222/models/WeightedEnsemble_L2/',
'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/LightGBMXT_BAG_L2/',
'LightGBM_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/LightGBM_BAG_L2/',
'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/RandomForestMSE_BAG_L2/',
'CatBoost_BAG_L2': 'AutogluonModels/ag-20221230_044222/models/CatBoost_BAG_L2/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20221230_044222/models/WeightedEnsemble_L3/'},
'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.03209257125854492,
'KNeighborsDist_BAG_L1': 0.02914905548095703,
'LightGBMXT_BAG_L1': 65.57744669914246,
'LightGBM_BAG_L1': 31.543081283569336,
'RandomForestMSE_BAG_L1': 11.122159719467163,
'CatBoost_BAG_L1': 200.1438946723938,
'ExtraTreesMSE_BAG_L1': 6.53611421585083,
'NeuralNetFastAI_BAG_L1': 74.58950996398926,
'WeightedEnsemble_L2': 0.7337920665740967,
'LightGBMXT_BAG_L2': 54.594863176345825,
'LightGBM_BAG_L2': 25.471513032913208,
'RandomForestMSE_BAG_L2': 26.825663328170776,
'CatBoost_BAG_L2': 57.02219581604004,
'WeightedEnsemble_L3': 0.2786529064178467},
'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10460901260375977,
'KNeighborsDist_BAG_L1': 0.10368776321411133,
'LightGBMXT_BAG_L1': 6.754477024078369,
'LightGBM_BAG_L1': 1.400993824005127,
'RandomForestMSE_BAG_L1': 0.5528538227081299,
'CatBoost_BAG_L1': 0.08083105087280273,
'ExtraTreesMSE_BAG_L1': 0.6820335388183594,
'NeuralNetFastAI_BAG_L1': 0.7653286457061768,
'WeightedEnsemble_L2': 0.0011434555053710938,
'LightGBMXT_BAG_L2': 3.1150238513946533,
'LightGBM_BAG_L2': 0.29674410820007324,
'RandomForestMSE_BAG_L2': 0.6180357933044434,
'CatBoost_BAG_L2': 0.06748366355895996,
'WeightedEnsemble_L3': 0.0008172988891601562},
'num_bag_folds': 8,
'max_stack_level': 3,
'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'KNeighborsDist_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'LightGBMXT_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBMXT_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -53.114374 14.542919 553.766336
1 RandomForestMSE_BAG_L2 -53.420043 11.062850 416.399112
2 LightGBM_BAG_L2 -55.065570 10.741559 415.044961
3 CatBoost_BAG_L2 -55.749057 10.512298 446.595644
4 LightGBMXT_BAG_L2 -60.411326 13.559839 444.168311
5 KNeighborsDist_BAG_L1 -84.125061 0.103688 0.029149
6 WeightedEnsemble_L2 -84.125061 0.104831 0.762941
7 KNeighborsUnif_BAG_L1 -101.546199 0.104609 0.032093
8 RandomForestMSE_BAG_L1 -116.544294 0.552854 11.122160
9 ExtraTreesMSE_BAG_L1 -124.588053 0.682034 6.536114
10 CatBoost_BAG_L1 -130.525167 0.080831 200.143895
11 LightGBM_BAG_L1 -131.054162 1.400994 31.543081
12 LightGBMXT_BAG_L1 -131.460909 6.754477 65.577447
13 NeuralNetFastAI_BAG_L1 -138.914862 0.765329 74.589510
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.000817 0.278653 3 True
1 0.618036 26.825663 2 True
2 0.296744 25.471513 2 True
3 0.067484 57.022196 2 True
4 3.115024 54.594863 2 True
5 0.103688 0.029149 1 True
6 0.001143 0.733792 2 True
7 0.104609 0.032093 1 True
8 0.552854 11.122160 1 True
9 0.682034 6.536114 1 True
10 0.080831 200.143895 1 True
11 1.400994 31.543081 1 True
12 6.754477 65.577447 1 True
13 0.765329 74.589510 1 True
fit_order
0 14
1 12
2 11
3 13
4 10
5 2
6 9
7 1
8 5
9 7
10 6
11 4
12 3
13 8 }
predictor.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
<AxesSubplot:xlabel='model'>
leaderboard = predictor.leaderboard(silent=True)
leaderboard["description"] = "001 basic features"
leaderboard.to_csv("leaderboard.csv", index=False)
leaderboard.head()
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WeightedEnsemble_L3 | -53.114374 | 14.542919 | 553.766336 | 0.000817 | 0.278653 | 3 | True | 14 | 001 basic features |
| 1 | RandomForestMSE_BAG_L2 | -53.420043 | 11.062850 | 416.399112 | 0.618036 | 26.825663 | 2 | True | 12 | 001 basic features |
| 2 | LightGBM_BAG_L2 | -55.065570 | 10.741559 | 415.044961 | 0.296744 | 25.471513 | 2 | True | 11 | 001 basic features |
| 3 | CatBoost_BAG_L2 | -55.749057 | 10.512298 | 446.595644 | 0.067484 | 57.022196 | 2 | True | 13 | 001 basic features |
| 4 | LightGBMXT_BAG_L2 | -60.411326 | 13.559839 | 444.168311 | 3.115024 | 54.594863 | 2 | True | 10 | 001 basic features |
predictions = predictor.predict(test)
predictions.head()
0 23.152916 1 41.841251 2 45.808411 3 49.782307 4 52.052742 Name: count, dtype: float32
# Describe the `predictions` series to see if there are any negative values
predictions.describe()
count 6493.000000 mean 100.730713 std 89.761986 min 3.153492 25% 19.992170 50% 64.159775 75% 167.717422 max 365.000427 Name: count, dtype: float64
# How many negative values do we have?
x = 0
for i in predictions:
if i < 0:
x += 1
print(x)
0
# Set them to zero
submission["count"] = predictions
submission.to_csv("submission.csv", index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "first raw submission"
100%|█████████████████████████████████████████| 188k/188k [00:00<00:00, 376kB/s] Successfully submitted to Bike Sharing Demand
My Submissions¶!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- --------------------------------- -------- ----------- ------------ submission.csv 2022-12-30 04:53:09 first raw submission complete 1.79188 1.79188 submission_hpo.csv 2022-12-30 04:34:39 new features and hpo complete 0.62542 0.62542 submission_new_features.csv 2022-12-30 04:16:09 model with new features complete 0.60781 0.60781 submission_new_hpo.csv 2022-12-30 03:44:56 new features and hpo complete 0.48505 0.48505 tail: write error: Broken pipe
?¶#Score: 1.79188
# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
train.hist(figsize=(12, 10))
plt.show()
# Create a new feature
train["hour"] = train["datetime"].dt.hour
test["hour"] = test["datetime"].dt.hour
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 |
# Profiler report
profile = ProfileReport(train)
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
# Visualizations
# Distribution of hourly bike demand by time features
train.groupby([train["datetime"].dt.month, "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by month (train data)")
train.groupby([train["datetime"].dt.hour, "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by hour (train data)")
train.groupby([train["datetime"].dt.dayofweek, "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by dayofweek (train data)")
plt.show()
train.groupby(["holiday"])["count"].median().plot(
kind='bar', title="Median of hourly bike demand by holiday (train data)")
plt.show()
# Distribution of hourly bike demand by weather features
train.groupby(["season", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by season (train data)")
train.groupby(["weather", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by weather (train data)")
train.groupby(["temp", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by temp (train data)")
train.groupby(["atemp", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by atemp (train data)")
train.groupby(["windspeed", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by windspeed (train data)")
train.groupby(["humidity", "workingday"])["count"].median().unstack().plot(
kind='bar', title="Median of hourly bike demand by humidity (train data)")
plt.show()
# Distribution of events by time features
train["season"].value_counts().plot(
kind='bar', title="Number of events by season (train data)")
plt.show()
train["weather"].value_counts().plot(
kind='bar', title="Number of events by weather (train data)")
plt.show()
train["holiday"].value_counts().plot(
kind='bar', title="Number of events by holiday (train data)")
plt.show()
train["workingday"].value_counts().plot(
kind='bar', title="Number of events by workingday (train data)")
plt.show()
# Functions for generating new features values
def get_daytime(hour):
if (hour >= 7) & (hour <= 9):
return "morning"
elif (hour >= 12) & (hour <= 15):
return "lunch"
elif (hour >= 16) & (hour <= 19):
return "rush_hour"
elif (hour >= 20) & (hour <= 23):
return "night"
else: return "other"
def get_tempcat(temp):
if (temp >= 35):
return "very hot"
elif (temp >= 25) & (temp < 35):
return "hot"
elif (temp >= 15) & (temp < 25):
return "warm"
elif (temp >= 10) & (temp < 15):
return "cool"
else: return "cold"
def get_windcat(windspeed):
if (windspeed > 20):
return "windy"
elif (windspeed > 10) & (windspeed <= 20):
return "mild"
else: return "low"
def get_humiditycat(humidity):
if (humidity >= 80):
return "high"
elif (humidity > 40) & (humidity < 80):
return "mild"
else: return "low"
# New features are generated
train["daytime"] = train['hour'].apply(get_daytime)
test['daytime'] = test['hour'].apply(get_daytime)
train['atempcat'] = train['atemp'].apply(get_tempcat)
test['atempcat'] = test['atemp'].apply(get_tempcat)
train['windcat'] = train['windspeed'].apply(get_windcat)
test['windcat'] = test['windspeed'].apply(get_windcat)
train['humiditycat'] = train['humidity'].apply(get_humiditycat)
test['humiditycat'] = test['humidity'].apply(get_humiditycat)
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | daytime | atempcat | windcat | humiditycat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 | other | cool | low | high |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 | other | cool | low | high |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 | other | cool | low | high |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 | other | cool | low | mild |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 | other | cool | low | mild |
train["daytime"].value_counts().plot(
kind='bar', title="Number of events by daytime (train data)")
plt.show()
train["atempcat"].value_counts().plot(
kind='bar', title="Number of events by atempcat (train data)")
plt.show()
train["windcat"].value_counts().plot(
kind='bar', title="Number of events by windcat (train data)")
plt.show()
train["humiditycat"].value_counts().plot(
kind='bar', title="Number of events by humiditycat (train data)")
plt.show()
category_list = ["season", "weather", "holiday", "workingday"]
train[category_list] = train[category_list].astype("category")
test[category_list] = test[category_list].astype("category")
new_category_list = ["daytime", "atempcat", "windcat", "humiditycat"]
train[new_category_list] = train[new_category_list].astype("category")
test[new_category_list] = test[new_category_list].astype("category")
# View the new feature
train.info()
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10886 entries, 0 to 10885 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 10886 non-null datetime64[ns] 1 season 10886 non-null category 2 holiday 10886 non-null category 3 workingday 10886 non-null category 4 weather 10886 non-null category 5 temp 10886 non-null float64 6 atemp 10886 non-null float64 7 humidity 10886 non-null int64 8 windspeed 10886 non-null float64 9 casual 10886 non-null int64 10 registered 10886 non-null int64 11 count 10886 non-null int64 12 hour 10886 non-null int64 13 daytime 10886 non-null category 14 atempcat 10886 non-null category 15 windcat 10886 non-null category 16 humiditycat 10886 non-null category dtypes: category(8), datetime64[ns](1), float64(3), int64(5) memory usage: 851.9 KB <class 'pandas.core.frame.DataFrame'> RangeIndex: 6493 entries, 0 to 6492 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 6493 non-null datetime64[ns] 1 season 6493 non-null category 2 holiday 6493 non-null category 3 workingday 6493 non-null category 4 weather 6493 non-null category 5 temp 6493 non-null float64 6 atemp 6493 non-null float64 7 humidity 6493 non-null int64 8 windspeed 6493 non-null float64 9 hour 6493 non-null int64 10 daytime 6493 non-null category 11 atempcat 6493 non-null category 12 windcat 6493 non-null category 13 humiditycat 6493 non-null category dtypes: category(8), datetime64[ns](1), float64(3), int64(2) memory usage: 356.5 KB
# View histogram of all features again now with the hour feature
train.hist(figsize=(10, 8))
plt.show()
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | hour | daytime | atempcat | windcat | humiditycat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 | other | cool | low | high |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 | other | cool | low | high |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 | other | cool | low | high |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 | other | cool | low | mild |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 | other | cool | low | mild |
# Fit model
learner_kwargs = {
"ignored_columns": ["casual", "registered"]
}
predictor_new_features = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression",
eval_metric="root_mean_squared_error").fit(train_data=train, time_limit=600, presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20221230_045712/"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20221230_045712/"
AutoGluon Version: 0.6.1
Python Version: 3.7.10
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Wed Oct 26 20:36:53 UTC 2022
Train Data Rows: 10886
Train Data Columns: 16
Label Column: count
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Dropping user-specified ignored columns: ['casual', 'registered']
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 1772.55 MB
Train Data (Original) Memory Usage: 0.61 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
/usr/local/lib/python3.7/site-packages/autogluon/features/generators/datetime.py:59: FutureWarning: casting datetime64[ns, UTC] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
good_rows = series[~series.isin(bad_rows)].astype(np.int64)
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', []) : 8 | ['season', 'holiday', 'workingday', 'weather', 'daytime', ...]
('datetime', []) : 1 | ['datetime']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 2 | ['humidity', 'hour']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 6 | ['season', 'weather', 'daytime', 'atempcat', 'windcat', ...]
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 2 | ['humidity', 'hour']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.4s = Fit runtime
14 features in original data used to generate 18 features in processed data.
Train Data (Processed) Memory Usage: 0.96 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 0.47s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.59s of the 599.53s of remaining time.
-101.5462 = Validation score (-root_mean_squared_error)
0.04s = Training runtime
0.1s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 399.2s of the 599.14s of remaining time.
-84.1251 = Validation score (-root_mean_squared_error)
0.03s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 398.84s of the 598.78s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-35.1225 = Validation score (-root_mean_squared_error)
70.96s = Training runtime
6.39s = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 321.87s of the 521.81s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-33.2233 = Validation score (-root_mean_squared_error)
55.37s = Training runtime
5.12s = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 261.33s of the 461.27s of remaining time.
-38.6807 = Validation score (-root_mean_squared_error)
13.91s = Training runtime
0.6s = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 244.39s of the 444.33s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-34.7403 = Validation score (-root_mean_squared_error)
209.71s = Training runtime
0.2s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 30.33s of the 230.27s of remaining time.
-37.9695 = Validation score (-root_mean_squared_error)
6.77s = Training runtime
0.58s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 20.49s of the 220.43s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-78.2766 = Validation score (-root_mean_squared_error)
43.4s = Training runtime
0.57s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 172.69s of remaining time.
-32.1908 = Validation score (-root_mean_squared_error)
0.64s = Training runtime
0.0s = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 171.95s of the 171.93s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-31.4643 = Validation score (-root_mean_squared_error)
29.67s = Training runtime
0.74s = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 137.36s of the 137.34s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-30.5442 = Validation score (-root_mean_squared_error)
28.09s = Training runtime
0.56s = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 104.84s of the 104.82s of remaining time.
-31.5157 = Validation score (-root_mean_squared_error)
30.52s = Training runtime
0.66s = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 71.39s of the 71.37s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-31.0877 = Validation score (-root_mean_squared_error)
70.74s = Training runtime
0.15s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -4.99s of remaining time.
-30.3377 = Validation score (-root_mean_squared_error)
0.42s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 605.62s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20221230_045712/")
predictor_new_features.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -30.337686 15.769274 559.629112 0.001314 0.419274 3 True 14
1 LightGBM_BAG_L2 -30.544203 14.215940 428.281188 0.555478 28.091259 2 True 11
2 CatBoost_BAG_L2 -31.087709 13.813266 470.927791 0.152805 70.737862 2 True 13
3 LightGBMXT_BAG_L2 -31.464343 14.400542 429.859494 0.740081 29.669564 2 True 10
4 RandomForestMSE_BAG_L2 -31.515716 14.319597 430.711153 0.659136 30.521224 2 True 12
5 WeightedEnsemble_L2 -32.190822 12.409169 350.620753 0.001431 0.644369 2 True 9
6 LightGBM_BAG_L1 -33.223304 5.116870 55.371105 5.116870 55.371105 1 True 4
7 CatBoost_BAG_L1 -34.740254 0.202554 209.710110 0.202554 209.710110 1 True 6
8 LightGBMXT_BAG_L1 -35.122505 6.385618 70.961028 6.385618 70.961028 1 True 3
9 ExtraTreesMSE_BAG_L1 -37.969525 0.583711 6.771391 0.583711 6.771391 1 True 7
10 RandomForestMSE_BAG_L1 -38.680737 0.598819 13.906645 0.598819 13.906645 1 True 5
11 NeuralNetFastAI_BAG_L1 -78.276613 0.565712 43.403863 0.565712 43.403863 1 True 8
12 KNeighborsDist_BAG_L1 -84.125061 0.103877 0.027496 0.103877 0.027496 1 True 2
13 KNeighborsUnif_BAG_L1 -101.546199 0.103299 0.038291 0.103299 0.038291 1 True 1
Number of models trained: 14
Types of models trained:
{'WeightedEnsembleModel', 'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'StackerEnsembleModel_NNFastAiTabular', 'StackerEnsembleModel_XT', 'StackerEnsembleModel_KNN', 'StackerEnsembleModel_RF'}
Bagging used: True (with 8 folds)
Multi-layer stack-ensembling used: True (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 6 | ['season', 'weather', 'daytime', 'atempcat', 'windcat', ...]
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 2 | ['humidity', 'hour']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20221230_045712/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
'KNeighborsDist_BAG_L1': -84.12506123181602,
'LightGBMXT_BAG_L1': -35.1225045152507,
'LightGBM_BAG_L1': -33.22330386513564,
'RandomForestMSE_BAG_L1': -38.68073745703023,
'CatBoost_BAG_L1': -34.74025415038994,
'ExtraTreesMSE_BAG_L1': -37.9695247790059,
'NeuralNetFastAI_BAG_L1': -78.27661254918321,
'WeightedEnsemble_L2': -32.19082176546677,
'LightGBMXT_BAG_L2': -31.46434331779796,
'LightGBM_BAG_L2': -30.54420311983629,
'RandomForestMSE_BAG_L2': -31.515715795837146,
'CatBoost_BAG_L2': -31.08770851309625,
'WeightedEnsemble_L3': -30.337686490843712},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/KNeighborsUnif_BAG_L1/',
'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/KNeighborsDist_BAG_L1/',
'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/LightGBMXT_BAG_L1/',
'LightGBM_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/LightGBM_BAG_L1/',
'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/RandomForestMSE_BAG_L1/',
'CatBoost_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/CatBoost_BAG_L1/',
'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/ExtraTreesMSE_BAG_L1/',
'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20221230_045712/models/NeuralNetFastAI_BAG_L1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20221230_045712/models/WeightedEnsemble_L2/',
'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/LightGBMXT_BAG_L2/',
'LightGBM_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/LightGBM_BAG_L2/',
'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/RandomForestMSE_BAG_L2/',
'CatBoost_BAG_L2': 'AutogluonModels/ag-20221230_045712/models/CatBoost_BAG_L2/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20221230_045712/models/WeightedEnsemble_L3/'},
'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.03829073905944824,
'KNeighborsDist_BAG_L1': 0.0274960994720459,
'LightGBMXT_BAG_L1': 70.96102786064148,
'LightGBM_BAG_L1': 55.3711051940918,
'RandomForestMSE_BAG_L1': 13.906644821166992,
'CatBoost_BAG_L1': 209.71011018753052,
'ExtraTreesMSE_BAG_L1': 6.771391153335571,
'NeuralNetFastAI_BAG_L1': 43.40386343002319,
'WeightedEnsemble_L2': 0.6443688869476318,
'LightGBMXT_BAG_L2': 29.669564247131348,
'LightGBM_BAG_L2': 28.091259002685547,
'RandomForestMSE_BAG_L2': 30.521223545074463,
'CatBoost_BAG_L2': 70.73786163330078,
'WeightedEnsemble_L3': 0.41927385330200195},
'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10329937934875488,
'KNeighborsDist_BAG_L1': 0.10387706756591797,
'LightGBMXT_BAG_L1': 6.385617971420288,
'LightGBM_BAG_L1': 5.116869926452637,
'RandomForestMSE_BAG_L1': 0.5988190174102783,
'CatBoost_BAG_L1': 0.20255446434020996,
'ExtraTreesMSE_BAG_L1': 0.5837111473083496,
'NeuralNetFastAI_BAG_L1': 0.5657124519348145,
'WeightedEnsemble_L2': 0.0014307498931884766,
'LightGBMXT_BAG_L2': 0.7400805950164795,
'LightGBM_BAG_L2': 0.5554780960083008,
'RandomForestMSE_BAG_L2': 0.6591355800628662,
'CatBoost_BAG_L2': 0.15280485153198242,
'WeightedEnsemble_L3': 0.0013136863708496094},
'num_bag_folds': 8,
'max_stack_level': 3,
'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'KNeighborsDist_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'LightGBMXT_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBMXT_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -30.337686 15.769274 559.629112
1 LightGBM_BAG_L2 -30.544203 14.215940 428.281188
2 CatBoost_BAG_L2 -31.087709 13.813266 470.927791
3 LightGBMXT_BAG_L2 -31.464343 14.400542 429.859494
4 RandomForestMSE_BAG_L2 -31.515716 14.319597 430.711153
5 WeightedEnsemble_L2 -32.190822 12.409169 350.620753
6 LightGBM_BAG_L1 -33.223304 5.116870 55.371105
7 CatBoost_BAG_L1 -34.740254 0.202554 209.710110
8 LightGBMXT_BAG_L1 -35.122505 6.385618 70.961028
9 ExtraTreesMSE_BAG_L1 -37.969525 0.583711 6.771391
10 RandomForestMSE_BAG_L1 -38.680737 0.598819 13.906645
11 NeuralNetFastAI_BAG_L1 -78.276613 0.565712 43.403863
12 KNeighborsDist_BAG_L1 -84.125061 0.103877 0.027496
13 KNeighborsUnif_BAG_L1 -101.546199 0.103299 0.038291
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.001314 0.419274 3 True
1 0.555478 28.091259 2 True
2 0.152805 70.737862 2 True
3 0.740081 29.669564 2 True
4 0.659136 30.521224 2 True
5 0.001431 0.644369 2 True
6 5.116870 55.371105 1 True
7 0.202554 209.710110 1 True
8 6.385618 70.961028 1 True
9 0.583711 6.771391 1 True
10 0.598819 13.906645 1 True
11 0.565712 43.403863 1 True
12 0.103877 0.027496 1 True
13 0.103299 0.038291 1 True
fit_order
0 14
1 11
2 13
3 10
4 12
5 9
6 4
7 6
8 3
9 7
10 5
11 8
12 2
13 1 }
predictor_new_features.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
<AxesSubplot:xlabel='model'>
# Save training/validation scores
leaderboard_nf = predictor_new_features.leaderboard(silent=True)
leaderboard_nf["description"] = "new features added"
leaderboard_nf.to_csv("leaderboard_new_features.csv", index=False)
leaderboard_nf.head()
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WeightedEnsemble_L3 | -30.337686 | 15.769274 | 559.629112 | 0.001314 | 0.419274 | 3 | True | 14 | new features added |
| 1 | LightGBM_BAG_L2 | -30.544203 | 14.215940 | 428.281188 | 0.555478 | 28.091259 | 2 | True | 11 | new features added |
| 2 | CatBoost_BAG_L2 | -31.087709 | 13.813266 | 470.927791 | 0.152805 | 70.737862 | 2 | True | 13 | new features added |
| 3 | LightGBMXT_BAG_L2 | -31.464343 | 14.400542 | 429.859494 | 0.740081 | 29.669564 | 2 | True | 10 | new features added |
| 4 | RandomForestMSE_BAG_L2 | -31.515716 | 14.319597 | 430.711153 | 0.659136 | 30.521224 | 2 | True | 12 | new features added |
predictions_nf = predictor_new_features.predict(test)
predictions_nf.head()
0 14.414740 1 10.229486 2 9.815841 3 8.203454 4 7.122305 Name: count, dtype: float32
predictions_nf.describe()
count 6493.000000 mean 161.998306 std 143.446411 min 2.800360 25% 48.185959 50% 126.332138 75% 229.810394 max 816.655762 Name: count, dtype: float64
# Remember to set all negative values to zero
x = 0
for i in predictions_nf:
if i < 0:
i = 0
x += 1
print(x)
0
submission_new_features = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_new_features.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
# Same submitting predictions
submission_new_features["count"] = predictions_nf.round(0).astype(int)
submission_new_features.to_csv("submission_new_features.csv", index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission_new_features.csv -m "model with new features"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 373kB/s] Successfully submitted to Bike Sharing Demand
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- --------------------------------- -------- ----------- ------------ submission_new_features.csv 2022-12-30 05:07:57 model with new features complete 0.61186 0.61186 submission.csv 2022-12-30 04:53:09 first raw submission complete 1.79188 1.79188 submission_hpo.csv 2022-12-30 04:34:39 new features and hpo complete 0.62542 0.62542 submission_new_features.csv 2022-12-30 04:16:09 model with new features complete 0.60781 0.60781 tail: write error: Broken pipe
?¶#Score with one additional feature (hour): 0.67642
#Score with more features: 0.61186
hyperparameter and hyperparameter_tune_kwargs arguments.import autogluon.core as ag
#hyperparameters
nn_options = { # specifies non-default hyperparameter values for neural network models
'num_epochs': 10, # number of training epochs (controls training time of NN models)
'learning_rate': ag.space.Real(1e-4, 1e-2, default=5e-4, log=True), # learning rate used in training (real-valued hyperparameter searched on log-scale)
'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'), # activation function used in NN (categorical hyperparameter, default = first entry)
'dropout_prob': ag.space.Real(0.0, 0.5, default=0.1), # dropout probability (real-valued hyperparameter)
}
gbm_options = { # specifies non-default hyperparameter values for lightGBM gradient boosted trees
'num_boost_round': 100, # number of boosting rounds (controls training time of GBM models)
'num_leaves': ag.space.Int(lower=26, upper=66, default=36), # number of leaves in trees (integer hyperparameter)
}
hyperparameters = { # hyperparameters of each model type
'GBM': gbm_options,
'NN_TORCH': nn_options, # NOTE: comment this line out if you get errors on Mac OSX
} # When these keys are missing from hyperparameters dict, no models of that type are trained
#hyperparameter_tune_kwargs
num_trials = 5 # try at most 5 different hyperparameter configurations for each type of model
search_strategy = 'auto' # to tune hyperparameters using random search routine with a local scheduler
hyperparameter_tune_kwargs = { # HPO is not performed unless hyperparameter_tune_kwargs is specified
'num_trials': num_trials,
'scheduler' : 'local',
'searcher': search_strategy,
}
learner_kwargs = {
"ignored_columns": ["casual", "registered"]
}
predictor_new_hpo = TabularPredictor(label="count", learner_kwargs=learner_kwargs, problem_type="regression",
eval_metric="root_mean_squared_error").fit(
train_data=train,
time_limit=600,
num_stack_levels=3,
num_bag_folds=10,
num_bag_sets=20,
hyperparameters=hyperparameters,
hyperparameter_tune_kwargs=hyperparameter_tune_kwargs)
No model was trained during hyperparameter tuning NeuralNetTorch_BAG_L4... Skipping this model.
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L5 ... Training model for up to 360.0s of the 102.67s of remaining time.
-38.9944 = Validation score (-root_mean_squared_error)
0.42s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 497.97s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20221230_053934/")
predictor_new_hpo.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -37.234105 0.001797 126.020105 0.001361 0.367549 3 True 7
1 LightGBM_BAG_L2/T3 -37.349089 0.000348 94.059461 0.000141 32.468658 2 True 6
2 LightGBM_BAG_L2/T2 -37.561850 0.000296 93.183897 0.000088 31.593095 2 True 5
3 LightGBM_BAG_L2/T1 -37.659894 0.000299 93.165067 0.000091 31.574265 2 True 4
4 WeightedEnsemble_L4 -37.847635 0.002242 252.986472 0.001394 0.589363 4 True 11
5 LightGBM_BAG_L3/T2 -37.933480 0.000627 188.578799 0.000100 31.351979 3 True 9
6 LightGBM_BAG_L3/T1 -37.998869 0.000615 189.135905 0.000088 31.909084 3 True 8
7 LightGBM_BAG_L3/T3 -38.185753 0.000661 189.136046 0.000133 31.909225 3 True 10
8 LightGBM_BAG_L1/T2 -38.539657 0.000103 30.861836 0.000103 30.861836 1 True 2
9 WeightedEnsemble_L2 -38.539657 0.003394 31.164896 0.003291 0.303059 2 True 3
10 WeightedEnsemble_L5 -38.994413 0.002591 349.357601 0.001472 0.418964 5 True 15
11 LightGBM_BAG_L4/T1 -39.101755 0.000936 284.917474 0.000088 32.520365 4 True 12
12 LightGBM_BAG_L4/T2 -39.170227 0.000946 283.458624 0.000098 31.061514 4 True 13
13 LightGBM_BAG_L4/T3 -39.225552 0.000934 285.356759 0.000085 32.959650 4 True 14
14 LightGBM_BAG_L1/T1 -40.206175 0.000104 30.728966 0.000104 30.728966 1 True 1
Number of models trained: 15
Types of models trained:
{'WeightedEnsembleModel', 'StackerEnsembleModel_LGB'}
Bagging used: True (with 10 folds)
Multi-layer stack-ensembling used: True (with 5 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 6 | ['season', 'weather', 'daytime', 'atempcat', 'windcat', ...]
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 2 | ['humidity', 'hour']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20221230_053934/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'LightGBM_BAG_L1/T1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1/T2': 'StackerEnsembleModel_LGB',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBM_BAG_L2/T1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2/T2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2/T3': 'StackerEnsembleModel_LGB',
'WeightedEnsemble_L3': 'WeightedEnsembleModel',
'LightGBM_BAG_L3/T1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L3/T2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L3/T3': 'StackerEnsembleModel_LGB',
'WeightedEnsemble_L4': 'WeightedEnsembleModel',
'LightGBM_BAG_L4/T1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L4/T2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L4/T3': 'StackerEnsembleModel_LGB',
'WeightedEnsemble_L5': 'WeightedEnsembleModel'},
'model_performance': {'LightGBM_BAG_L1/T1': -40.2061752974064,
'LightGBM_BAG_L1/T2': -38.53965666990459,
'WeightedEnsemble_L2': -38.53965666990459,
'LightGBM_BAG_L2/T1': -37.659894371651326,
'LightGBM_BAG_L2/T2': -37.561849723350264,
'LightGBM_BAG_L2/T3': -37.34908938630387,
'WeightedEnsemble_L3': -37.23410485308833,
'LightGBM_BAG_L3/T1': -37.99886917604992,
'LightGBM_BAG_L3/T2': -37.933479660418755,
'LightGBM_BAG_L3/T3': -38.18575313187197,
'WeightedEnsemble_L4': -37.84763528055502,
'LightGBM_BAG_L4/T1': -39.10175531610709,
'LightGBM_BAG_L4/T2': -39.17022731113232,
'LightGBM_BAG_L4/T3': -39.225551602020325,
'WeightedEnsemble_L5': -38.99441314595425},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'LightGBM_BAG_L1/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L1/T1/',
'LightGBM_BAG_L1/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L1/T2/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L2/',
'LightGBM_BAG_L2/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L2/T1/',
'LightGBM_BAG_L2/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L2/T2/',
'LightGBM_BAG_L2/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L2/T3/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L3/',
'LightGBM_BAG_L3/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L3/T1/',
'LightGBM_BAG_L3/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L3/T2/',
'LightGBM_BAG_L3/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L3/T3/',
'WeightedEnsemble_L4': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L4/',
'LightGBM_BAG_L4/T1': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L4/T1/',
'LightGBM_BAG_L4/T2': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L4/T2/',
'LightGBM_BAG_L4/T3': '/root/udacity_nd009t-c1-intro-to-ml-project-starter/AutogluonModels/ag-20221230_053934/models/LightGBM_BAG_L4/T3/',
'WeightedEnsemble_L5': 'AutogluonModels/ag-20221230_053934/models/WeightedEnsemble_L5/'},
'model_fit_times': {'LightGBM_BAG_L1/T1': 30.728965759277344,
'LightGBM_BAG_L1/T2': 30.861836433410645,
'WeightedEnsemble_L2': 0.30305910110473633,
'LightGBM_BAG_L2/T1': 31.574265003204346,
'LightGBM_BAG_L2/T2': 31.59309482574463,
'LightGBM_BAG_L2/T3': 32.468658447265625,
'WeightedEnsemble_L3': 0.36754918098449707,
'LightGBM_BAG_L3/T1': 31.90908432006836,
'LightGBM_BAG_L3/T2': 31.35197901725769,
'LightGBM_BAG_L3/T3': 31.909225463867188,
'WeightedEnsemble_L4': 0.5893630981445312,
'LightGBM_BAG_L4/T1': 32.52036452293396,
'LightGBM_BAG_L4/T2': 31.061514377593994,
'LightGBM_BAG_L4/T3': 32.95964956283569,
'WeightedEnsemble_L5': 0.4189636707305908},
'model_pred_times': {'LightGBM_BAG_L1/T1': 0.00010442733764648438,
'LightGBM_BAG_L1/T2': 0.00010323524475097656,
'WeightedEnsemble_L2': 0.0032906532287597656,
'LightGBM_BAG_L2/T1': 9.107589721679688e-05,
'LightGBM_BAG_L2/T2': 8.797645568847656e-05,
'LightGBM_BAG_L2/T3': 0.00014066696166992188,
'WeightedEnsemble_L3': 0.0013608932495117188,
'LightGBM_BAG_L3/T1': 8.797645568847656e-05,
'LightGBM_BAG_L3/T2': 9.989738464355469e-05,
'LightGBM_BAG_L3/T3': 0.00013327598571777344,
'WeightedEnsemble_L4': 0.0013937950134277344,
'LightGBM_BAG_L4/T1': 8.7738037109375e-05,
'LightGBM_BAG_L4/T2': 9.751319885253906e-05,
'LightGBM_BAG_L4/T3': 8.535385131835938e-05,
'WeightedEnsemble_L5': 0.0014719963073730469},
'num_bag_folds': 10,
'max_stack_level': 5,
'model_hyperparams': {'LightGBM_BAG_L1/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1/T2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2/T2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2/T3': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L3/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L3/T2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L3/T3': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L4': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L4/T1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L4/T2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L4/T3': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L5': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -37.234105 0.001797 126.020105
1 LightGBM_BAG_L2/T3 -37.349089 0.000348 94.059461
2 LightGBM_BAG_L2/T2 -37.561850 0.000296 93.183897
3 LightGBM_BAG_L2/T1 -37.659894 0.000299 93.165067
4 WeightedEnsemble_L4 -37.847635 0.002242 252.986472
5 LightGBM_BAG_L3/T2 -37.933480 0.000627 188.578799
6 LightGBM_BAG_L3/T1 -37.998869 0.000615 189.135905
7 LightGBM_BAG_L3/T3 -38.185753 0.000661 189.136046
8 LightGBM_BAG_L1/T2 -38.539657 0.000103 30.861836
9 WeightedEnsemble_L2 -38.539657 0.003394 31.164896
10 WeightedEnsemble_L5 -38.994413 0.002591 349.357601
11 LightGBM_BAG_L4/T1 -39.101755 0.000936 284.917474
12 LightGBM_BAG_L4/T2 -39.170227 0.000946 283.458624
13 LightGBM_BAG_L4/T3 -39.225552 0.000934 285.356759
14 LightGBM_BAG_L1/T1 -40.206175 0.000104 30.728966
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.001361 0.367549 3 True
1 0.000141 32.468658 2 True
2 0.000088 31.593095 2 True
3 0.000091 31.574265 2 True
4 0.001394 0.589363 4 True
5 0.000100 31.351979 3 True
6 0.000088 31.909084 3 True
7 0.000133 31.909225 3 True
8 0.000103 30.861836 1 True
9 0.003291 0.303059 2 True
10 0.001472 0.418964 5 True
11 0.000088 32.520365 4 True
12 0.000098 31.061514 4 True
13 0.000085 32.959650 4 True
14 0.000104 30.728966 1 True
fit_order
0 7
1 6
2 5
3 4
4 11
5 9
6 8
7 10
8 2
9 3
10 15
11 12
12 13
13 14
14 1 }
predictor_new_hpo.leaderboard(silent=True).plot(kind="bar", x="model", y="score_val")
<AxesSubplot:xlabel='model'>
leaderboard_hpo = predictor_new_hpo.leaderboard(silent=True)
leaderboard_hpo["description"] = "hpo"
leaderboard_hpo.to_csv("leaderboard_hpo.csv", index=False)
leaderboard_hpo.head()
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WeightedEnsemble_L3 | -37.234105 | 0.001797 | 126.020105 | 0.001361 | 0.367549 | 3 | True | 7 | hpo |
| 1 | LightGBM_BAG_L2/T3 | -37.349089 | 0.000348 | 94.059461 | 0.000141 | 32.468658 | 2 | True | 6 | hpo |
| 2 | LightGBM_BAG_L2/T2 | -37.561850 | 0.000296 | 93.183897 | 0.000088 | 31.593095 | 2 | True | 5 | hpo |
| 3 | LightGBM_BAG_L2/T1 | -37.659894 | 0.000299 | 93.165067 | 0.000091 | 31.574265 | 2 | True | 4 | hpo |
| 4 | WeightedEnsemble_L4 | -37.847635 | 0.002242 | 252.986472 | 0.001394 | 0.589363 | 4 | True | 11 | hpo |
predictions_hpo = predictor_new_hpo.predict(test)
predictions_hpo.head()
0 9.378200 1 6.759612 2 6.675080 3 6.639610 4 6.622023 Name: count, dtype: float32
predictions_hpo.describe()
count 6493.000000 mean 191.546890 std 173.236862 min 5.678326 25% 48.135056 50% 148.766876 75% 287.000244 max 869.541260 Name: count, dtype: float64
# Remember to set all negative values to zero
x = 0
for i in predictions_hpo:
if i < 0:
i = 0
x += 1
print(x)
0
submission_hpo = pd.read_csv("sampleSubmission.csv", parse_dates=["datetime"])
submission_hpo.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
# Same submitting predictions
submission_hpo["count"] = predictions_hpo.round(0).astype(int)
submission_hpo.to_csv("submission_hpo.csv", index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission_hpo.csv -m "new features and hpo"
100%|█████████████████████████████████████████| 149k/149k [00:00<00:00, 312kB/s] Successfully submitted to Bike Sharing Demand
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- --------------------------------- -------- ----------- ------------ submission_hpo.csv 2022-12-30 05:47:57 new features and hpo complete 0.48165 0.48165 submission_new_features.csv 2022-12-30 05:07:57 model with new features complete 0.61186 0.61186 submission.csv 2022-12-30 04:53:09 first raw submission complete 1.79188 1.79188 submission_hpo.csv 2022-12-30 04:34:39 new features and hpo complete 0.62542 0.62542 tail: write error: Broken pipe
?¶#Score: 0.48165
#score (default hpo): 0.62542
# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
{
"model": ["initial", "add_features", "hpo"],
"score": [53.114374, 30.337686, 37.234105]
}
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_train_score.png')
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
{
"test_eval": ["initial", "add_features", "hpo"],
"score": [1.79188, 0.61186, 0.48505]
}
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_test_score.png')
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
"model": ["initial", "add_features", "hpo"],
"num_stack_levels": [1, 1, 3],
"num_bag_folds": [8, 8, 10],
"num_bag_sets": [20, 20, 20],
"score": [1.80502, 0.67176, 0.48165]
})
| model | num_stack_levels | num_bag_folds | num_bag_sets | score | |
|---|---|---|---|---|---|
| 0 | initial | 1 | 8 | 20 | 1.80502 |
| 1 | add_features | 1 | 8 | 20 | 0.67176 |
| 2 | hpo | 3 | 10 | 20 | 0.48165 |